step size parameter
Supplement: Proximity Operator of the Matrix Perspective Function and its Applications Joong-Ho Won Department of Statistics Seoul National University wonj@stats.snu.ac.kr A Proofs A.1 A key lemma
Proofs of both Theorems 2 and 4 are based on the following key lemma, Lemma A.1. To prove this lemma, we begin by recalling the definition of directional derivatives. F (x + t h) F (x) t if the limit exists. Now we can prove the lemma: Proof of Lemma A.1. The following lemma shows a representation of an element of this set in terms of M: Lemma A.3.
Dynamical softassign and adaptive parameter tuning for graph matching
Shen, Binrui, Niu, Qiang, Zhu, Shengxin
This paper studies a framework, projected fixed-point method, for graph matching. The framework contains a class of popular graph matching algorithms, including graduated assignment (GA), integer projected fixed-point method (IPFP) and doubly stochastic projected fixed-point method (DSPFP). We propose an adaptive strategy to tune the step size parameter in this framework. Such a strategy improves these algorithms in efficiency and accuracy. Further, it guarantees the convergence of the underlying algorithms. Some preliminary analysis based on distance geometry seems to support that the optimal step size parameter has a high probability of 1 when graphs are fully connected. Secondly, it is observed that a popular projection method, softassign, is sensitive to graphs' cardinality(size). We proposed a dynamical softassign algorithm that is robust to graphs' cardinality. Combining the adaptive step size and the dynamical softassign, we propose a novel graph matching algorithm: the adaptive projected fixed-point method with dynamical softassign. Various experiments demonstrate that the proposed algorithm is significantly faster than several other state-of-art algorithms with no loss of accuracy.
Theoretical Interpretation of Learned Step Size in Deep-Unfolded Gradient Descent
Takabe, Satoshi, Wadayama, Tadashi
Theoretical Interpretation of Learned Step Size in Deep-Unfolded Gradient Descent Satoshi Takabe † and Tadashi Wadayama Nagoya Institute of Technology, Gokiso, Nagoya, Aichi, 466-8555, Japan, {wadayama, s_takabe}@nitech.ac.jp † RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, 103-0027, Japan Abstract --Deep unfolding is a promising deep-learning technique in which an iterative algorithm is unrolled to a deep network architecture with trainable parameters. In the case of gradient descent algorithms, as a result of the training process, one often observes the acceleration of the convergence speed with learned non-constant step size parameters whose behavior is not intuitive nor interpretable from conventional theory. In this paper, we provide a theoretical interpretation of the learned step size of deep-unfolded gradient descent (DUGD). We first prove that the training process of DUGD reduces not only the mean squared error loss but also the spectral radius related to the convergence rate. Next, we show that minimizing the upper bound of the spectral radius naturally leads to the Chebyshev step which is a sequence of the step size based on Chebyshev polynomials. The numerical experiments confirm that the Chebyshev steps qualitatively reproduce the learned step size parameters in DUGD, which provides a plausible interpretation of the learned parameters. Additionally, we show that the Chebyshev steps achieve the lower bound of the convergence rate for the first-order method in a specific limit without learning parameters or momentum terms. I NTRODUCTION Deep unfolding [10], [12] is a promising deep learning approach whose architecture is based on existing iterative algorithms with tuning parameters such as step sizes in gradient descent (GD). The recursive structure of the algorithm is unrolled to a deep network and some parameters are embedded into the network. These parameters can be trained using standard deep learning techniques such as back propagation and stochastic GD if all the processes in the algorithm are differentiable. One notable advantage of deep unfolding is the acceleration of the convergence speed that results from tuning parameters compared with the original algorithm. Embedding proper trainable parameters also offers a flexible network structure to the algorithm that is applicable, for example, to inverse problems with/without prior information [26]. Recently, theoretical aspects of deep unfolding have also been investigated [5], [21], [23]. MSE performance (upper) and learned step size parameters {γ t} 24 t 0 (lower) of DUGD (circles) and GD with a constant step size (cross marks) when (n,m) (300, 600).
Learned Step Size Quantization
Esser, Steven K., McKinstry, Jeffrey L., Bablani, Deepika, Appuswamy, Rathinakumar, Modha, Dharmendra S.
We present here Learned Step Size Quantization, a method for training deep networks such that they can run at inference time using low precision integer matrix multipliers, which offer power and space advantages over high precision alternatives. The essence of our approach is to learn the step size parameter of a uniform quantizer by backpropagation of the training loss, applying a scaling factor to its learning rate, and computing its associated loss gradient by ignoring the discontinuity present in the quantizer. This quantization approach can be applied to activations or weights, using different levels of precision as needed for a given system, and requiring only a simple modification of existing training code. As demonstrated on the ImageNet dataset, our approach achieves better accuracy than all previous published methods for creating quantized networks on several ResNet network architectures at 2-, 3- and 4-bits of precision.